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Abstract 

We propose a new model for learning bilingual word representations from non-parallel 
document-aligned data. Following the recent advances in word representation learning, our 
model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). 
Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned cor¬ 
pora and/or readily available translation resources such as dictionaries, the article reveals 
that BWEs may be learned solely on the basis of document-aligned comparable data with¬ 
out any additional lexical resources nor syntactic information. We present a comparison 
of our approach with previous state-of-the-art models for learning bilingual word repre¬ 
sentations from comparable data that rely on the framework of multilingual probabilistic 
topic modeling (MuPTM), as well as with distributional local context-counting models. We 
demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon 
extraction, (2) suggesting word translations in context for polysemous words. Our simple 
yet effective BWE-based models significantly outperform the MuPTM-based and context¬ 
counting representation models from comparable data as well as prior BWE-based models, 
and acquire the best reported results on both tasks for all three tested language pairs. 


1. Introduction 

A huge body of work in distributional semantics and word representation learning almost 
exclusively revolves around the distributional hypothesis (Harris, 1954) - an idea which 
states that similar words occur in similar contexts. All current corpus-based approaches to 
semantics rely on the contextual evidence in one way or another. Roughly speaking, word 
representations are typically learned using these two families of distributional context-based 
models: (1) global matrix factorization models such as latent semantic analysis (LSA) 
(Landauer &: Dumais, 1997) or generative probabilistic models such as latent Dirichlet 
allocation (LDA) (Blei, Ng, &: Jordan, 2003), which model the word co-occurrence at the 
document or paragraph level; or (2) local context window models that represent words as 
sparse high-dimensional context vectors, and model the word co-occurrence at the level of 
selected neighboring words (Turney &: Pantel, 2010), or generative probabilistic models that 
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learn the probability distribution of a vocabulary word in the context window as a latent 
variable (Deschacht & Moens, 2009; Deschacht, De Beider, & Moens, 2012). 

On the other hand, dense real-valued vectors known as distributed representations of 
words or word embeddings (WEs) (e.g., Bengio, Ducharme, Vincent, & Janvin, 2003; Col- 
lobert &: Weston, 2008; Mikolov, Chen, Corrado, &: Dean, 2013a; Pennington, Socher, & 
Manning, 2014) have been introduced recently, first as part of neural network based archi¬ 
tectures for statistical language modeling. WEs serve as richer and more coherent word 
representations than the ones obtained by the aforementioned traditional distributional 
semantic models, with illustrative comparative studies available in the recently published 
relevant work (e.g., Mikolov, Yih, & Zweig, 2013d; Baroni, Dinu, & Kruszewski, 2014; Levy, 
Goldberg, h Dagan, 2015). 

A natural extension of interest from monolingual to multilingual word embeddings has 
occurred recently (e.g., Klementiev, Titov, & Bhattarai, 2012; Hermann & Blunsom, 2014b). 
When operating in multilingual settings, it is highly desirable to learn embeddings for words 
denoting similar concepts that are very close in the shared bilingual embedding space (e.g., 
the representations for the English word school and the Spanish word escuela should be 
very similar). These BWEs may then be used in a myriad of multilingual natural language 
processing tasks and beyond, such as fundamental tasks leaning on such bilingual meaning 
representations, e.g., computing cross-lingual and multilingual semantic word similarity and 
extracting bilingual word lexicons using the induced bilingual embedding space (see Fig¬ 
ure 1). However, all these models critically require (at least) sentence-aligned parallel data 
and readily-available translation dictionaries to induce bilingual word embeddings (BWEs) 
that are consistent and closely aligned over different languages. 

Contributions. To the best of our knowledge, this article presents the first work to 
showcase that bilingual word embeddings may be induced directly on the basis of compa¬ 
rable data without any additional bilingual resources such as sentence-aligned parallel data 
or translation dictionaries. The focus is on document-aligned comparable corpora (e.g., 
Wikipedia articles aligned through inter-wiki links, news texts discussing the same theme). 

Our new bilingual distributed representation learning model makes use of pseudo-bilingual 
documents constructed by merging the content of two coupled documents from a document 
pair, where we propose and evaluate two different strategies on how to construct such 
pseudo-bilingual documents: (1) merge and randomly shuffle strategy which randomly per¬ 
mutes words from both languages in each pseudo-bilingual document, and (2) length-ratio 
shuffle strategy, a deterministic method that retains monolingual word order while intermin¬ 
gling the words cross-lingually. These additional pre-training shuffling strategies ensure that 
both source language words and target language words occur in the contexts of each source 
and target language word. A monolingual model such as skip-gram with negative sampling 
(SONS) from the word2vec package (Mikolov, Sutskever, Chen, Corrado, &: Dean, 2013c) 
is then trained on these “shuffled” pseudo-bilingual documents. By this procedure, we steer 
semantically similar words from different languages towards similar representations in the 
shared bilingual embedding space, and effectively use available bilingual contexts instead of 
monolingual ones. The model treats documents as bags-of-words (i.e., it does not include 
any syntactic information) and does not even rely on any sentence boundary information. 

In summary, the main contributions of this article are: 


2 



Bilingual Distributed Word Representations from Document-Aligned Data 


(1) We present BWE Skip-Gram (BWESG), the first model that induces bilingual word 
embeddings directly from document-aligned non-parallel data. We test and evaluate two 
main variants of the model based on the pre-training shuffling step. The main strength of 
the presented model lies in its favourable trade-off between simplicity and effectiveness. 

(2) We provide a qualitative and quantitative analysis of the model. We draw analogies and 
comparisons with prior work on inducing word representations from the same data type: 
document-aligned comparable corpora (e.g., models relying on the multilingual probabilistic 
topic modeling framework (MuPTM)). 

(3) We demonstrate the utility of induced BWEs at the word type level in the task of bilin¬ 
gual lexicon extraction (BLE) from Wikipedia data for three language pairs. A BLE model 
based on our BWEs significantly outperforms MuPTM-based and context-counting BLE 
models, and acquires the best reported scores on the benchmarking BLE datasets. 

(4) We demonstrate the utility of induced BWEs at the word token level in the task of 
suggesting word translations in context (SWTC) (Vulic & Moens, 2014) for the same three 
language pairs. A SWTC model based on our BWEs again significantly outscores the best 
scoring MuPTM-based SWTC models in the same setting without any use of parallel data 
and translation dictionaries, and again acquires the best reported results on the benchmark¬ 
ing SWTC datasets. 

(5) We also present a comparison with state-of-the-art BWE induction models (Mikolov, 
Le, & Sutskever, 2013b; Hermann & Blunsom, 2014b; Gouws, Bengio, & Corrado, 2015) 
in BLE and SWTC. Results reveal that our simple yet effective approach is on-par with 
or outperforms other BWE induction models that rely on parallel data or readily available 
dictionaries to learn shared bilingual embedding spaces. In addition, preliminary experi¬ 
ments with BWESG on parallel Europarl data demonstrate that the model is also useful 
when trained on sentence-aligned data, reaching the performance of benchmarking BWE 
induction models from parallel data (e.g., Hermann &: Blunsom, 2014b). 

2. Related Work 

In this section we further motivate why we opt for building a model for inducing bilingual 
word embeddings from comparable document-aligned data. Eor a clearer overview, we have 
split related work into three broad clusters: (1) monolingual word embeddings, (2) bilingual 
word embeddings, and (3) bilingual word representations from document-aligned data. 

2.1 Monolingual Word Embeddings 

The idea of representing words as continuous real-valued vectors dates way back to mid- 
80s (Rumelhart, Hinton, & Williams, 1986; Elman, 1990). The idea met its resurgence a 
decade ago (Bengio et ah, 2003), where a neural language model learns word embeddings as 
part of a neural network architecture for statistical language modeling. This work inspired 
other approaches that learn word embeddings within the neural-network language modeling 
framework (Collobert & Weston, 2008; Collobert, Weston, Bottou, Karlen, Kavukcuoglu, & 
Kuksa, 2011). Word embeddings are tailored to capture semantics and encode a continuous 
notion of semantic similarity (as opposed to semantically poorer discrete representations), 
necessary to share information between words and other text units. 
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Bilingual 


Figure 1: A toy 3D shared bilingual embedding space from Gouws et al. (2015): While in 
monolingual spaces words with similar meanings should have similar representa¬ 
tions, in bilingual spaces words in two different languages with similar meanings 
should have similar representations (both mono- and cross-lingually). 


Recently, the skip-gram and continuous bag-of-words (CBOW) model from Mikolov et 
al. (2013a, 2013c) revealed that the full neural-network structure is not needed at all to 
learn high-quality word embeddings (with extremely decreased training times compared to 
the full-fledged neural network models, see Mikolov et al.’s (2013a) work for the full analysis 
of complexity of the models). These models are in fact simple single-layered architectures, 
where the objective is to predict a word’s context given the word itself (skip-gram) or 
predict a word given its context (CBOW). Similar models called vector log-bilinear models 
were recently proposed (Mnih & Kavukcuoglu, 2013). Other models inspired by skip-gram 
and CBOW are GloVe (Global Vectors for Word Representation) (Pennington et ah, 2014), 
which combines local and global contexts of a word into a unified model, and a model 
which relies on dependency-based contexts instead of simpler word-based contexts (Levy & 
Goldberg, 2014a), and new models are steadily emerging (e.g., Lebret & Collobert, 2014; 
Lu, Wang, Bansal, Gimpel, &: Livescu, 2015; Stratos, Collins, & Hsu, 2015; Trask, Gilmore, 
&: Russell, 2015; Liu, Jiang, Wei, Ling, & Hu, 2015). 

An interesting finding has been discussed recently (Levy h Goldberg, 2014b): the pop¬ 
ular skip-gram model with negative sampling (SGNS) (Goldberg & Levy, 2014) is simply a 
model which implicitly factorizes a word-context matrix, with its cells containing pointwise 
mutual information (PMI) scores of the respective word and context pairs, shifted by a 
global constant. In other words, the SGNS performs exactly the same thing as traditional 
distributional models (i.e., context counting plus context weighting and/or dimensionality 
reduction), with a slight improvement in performance with SGNS (Baroni et ah, 2014; Levy 
et ah, 2015). 

All these low-dimensional vectors, besides improving computational efficiency, lead to 
better generalizations, even allowing to generalize over the vocabularies observed in labelled 
data, and hence partially alleviating the ubiquitous problem of data sparsity. Their utility 
has been validated and proven in various semantic tasks such as semantic word similarity, 
synonymy detection or word analogy solving (Mikolov et ah, 2013d; Baroni et ah, 2014; 
Pennington et ah, 2014). Moreover, word embeddings have been proven to serve as useful 
unsupervised features for plenty of downstream NLP tasks such as named entity recognition. 
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chunking, semantic role labeling, part-of-speech tagging, parsing, selectional preferences 
(Turian, Ratinov, & Bengio, 2010; Collobert et ah, 2011; Chen &: Manning, 2014). 

Due to its simplicity, as well as its efficacy and consequent popularity in various tasks 
(Mikolov et ah, 2013c; Levy &: Goldberg, 2014b), with a clear advantage on similarity tasks 
when compared to traditional models from distributional semantics (Levy et ah, 2015) in 
this article we will focus on the adaptation of SGNS (Mikolov et ah, 2013c). In Section 3, 
we provide a very brief overview of the model, and then follow up with our new bilingual 
model which is based on SGNS. 

2.2 Bilingual Word Embeddings 

Bilingual word representations could serve as an useful source knowledge for problems in 
cross-lingual information retrieval (Levow, Oard, & Resnik, 2005; Vulic, De Smet, & Moens, 

2013) , statistical machine translation (Wu, Wang, &: Zong, 2008), document classification 
(Ni, Sun, Hu, & Chen, 2011; Klementiev et ah, 2012; Hermann & Blunsom, 2014b; Chan- 
dar, Lauly, Larochelle, Khapra, Ravindran, Raykar, & Saha, 2014; Vulic, De Smet, Tang, 
& Moens, 2015), bilingual lexicon extraction (Tamura, Watanabe, & Sumita, 2012; Vulic 
& Moens, 2013a), or knowledge transfer and annotation projection from resource-rich to 
resource-poor languages for a myriad of NLP tasks such as dependency parsing, POS tag¬ 
ging, semantic role labeling or selectional preferences (Yarowsky & Ngai, 2001; Pado & 
Lapata, 2009; Peirsman Sz Pado, 2010; Das Sz Petrov, 2011; Tackstrom, Das, Petrov, Mc¬ 
Donald, Sz Nivre, 2013; Ganchev Sz Das, 2013; Tiedemann, Agic, Sz Nivre, 2014; Xiao 
& Guo, 2014). Other interesting application domains are machine translation (e.g., Zou, 
Socher, Ger, & Manning, 2013; Wu, Dong, Hu, Yu, He, Wu, Wang, Sz Liu, 2014; Zhang, Liu, 
Li, Zhou, Sz Zong, 2014) and cross-lingual information retrieval (e.g., Vulic Sz Moens, 2015). 
Moreover, by making the transition from monolingual to bilingual settings and building a 
shared bilingual embedding space (see again Figure 1 for an illustrative example), one is able 
to extend or rather generalize semantic tasks such as semantic similarity computation, syn¬ 
onymy detection or word analogy computation across languages. Following the success in 
monolingual settings, a body of recent work on word representation learning has therefore 
focused on learning bilingual word embeddings (BWEs). 

The current research on inducing BWEs critically relies on sentence-aligned parallel 
data or readily available bilingual lexicons to achieve the coherence of representations across 
languages (e.g., to build similar representations for similar concepts in different languages 
such as January-januari, dog-hund or sky-hemel). We may cluster the current work in three 
different groups: (1) the models that rely on hard word alignments obtained from parallel 
data to constrain the learning of BWEs (Klementiev et ah, 2012; Zou et ah, 2013; Wu et ah, 

2014) ; (2) the models that use the alignment of parallel data at the sentence level (Kocisky, 
Hermann, Sz Blunsom, 2014; Hermann Sz Blunsom, 2014a, 2014b; Chandar et ah, 2014; Shi, 
Liu, Liu, Sz Sun, 2015; Gouws et ah, 2015); (3) the models that critically require readily 
available bilingual lexicons (Mikolov et ah, 2013b; Faruqui Sz Dyer, 2014; Xiao Sz Guo, 
2014). The main disadvantage of all these models is the limited availability of parallel data 
and bilingual lexicons, resources which are scarce and/or domain-restricted for plenty of 
language pairs. In this work, we significantly alleviate the requirements: unlike prior work, 
we show that BWEs may be induced solely on the basis of document-aligned comparable 
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data without any additional need for parallel data or bilingual lexicons. Note that (in 
theory) the work from Hermann and Blunsom (2014b), Chandar et al. (2014) may also be 
extended to the same setting with document-aligned data, as these two models originally 
rely on sentence embeddings computed as aggregations over their single word embeddings 
plus sentence alignments. In this work, by testing and comparing to the BiCVM model from 
Hermann and Blunsom (2014b), we show that these models do not work well in practice 
after replacing the very strong bilingual signal coded in parallel sentences with the noisy 
bilingual signal given by document alignments and non-parallel data. 


2.3 Bilingual Word Representations from Document-Aligned Data 

Prior work on inducing bilingual word representations in the early days followed the tradi¬ 
tion of window-based context-counting distributional models (Rapp, 1999; Gaussier, Ren¬ 
ders, Matveeva, Goutte, &: Dejean, 2004; Laroche & Langlais, 2010) and it again required 
a bilingual lexicon as a critical resource. In order to tackle this issue, recent work relies on 
the supervision-lighter framework of multilingual probabilistic topic modeling (MuPTM) 
(Mimno, Wallach, Naradowsky, Smith, & McCallum, 2009; Boyd-Graber &: Blei, 2009; 
De Smet &: Moens, 2009; Ni, Sun, Hu, & Chen, 2009; Zhang, Mei, Sz Zhai, 2010; Fukumasu, 
Eguchi, &: Xing, 2012) or other similar models for latent structure induction (Haghighi, 
Liang, Berg-Kirkpatrick, & Klein, 2008; Daume HI & Jagarlamudi, 2011). 

Words in this setting are represented as real-valued vectors with conditional topic prob¬ 
ability scores P{zk\wi), regardless of their actual language. Topics Zk are in fact latent 
inter-lingual concepts discovered directly from multilingual comparable data using a multi¬ 
lingual topic model such as bilingual LDA. We discuss the MuPTM-based representations 
in more detail in Section 4.1. 

MuPTM-based bilingual word representations induced from comparable data have demon¬ 
strated its utility in tasks such as cross-lingual semantic similarity computation and bilingual 
lexicon extraction (Vulic, De Smet, & Moens, 2011; Liu, Duh, & Matsumoto, 2013) and 
suggesting word translations in context (Vulic & Moens, 2014). In this work, we compare 
the state-of-the-art MuPTM-based word representations induced from the same type of 
comparable corpora with BWEs learned by our new model in these two semantic tasks. 

Another recent model (Spgaard, Agic, Martinez Alonso, Plank, Bohnet, &: Johannsen, 
2015) is also able to learn from document-aligned data. It is a count-based model which 
builds binary word vectors denoting the occurrence of each word in each document pair. 
Dimensionality reduction is then applied post-hoc on the induced sparse vectors. Since the 
links between documents are known, the model is able to learn cross-lingual correspondences 
between words and, consequently, bilingual word representations. Exactly the same idea 
was already introduced as a baseline model by Vulic et al. (2011), where TF-IDF weights 
were used instead of binary indices, and no dimensionality reduction was applied post-hoc. 
The model from Vulic et al. (2011) was surpassed by baseline models from document-aligned 
data briefly discussed in Section 4.1, while the model from Spgaard et al. (2015) obtains 
results that are very similar to the BWE baselines compared against in this work (described 
in Section 4.2). 
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3. BWESG: Model Architecture 

Our new bilingual model is an extension of SONS to bilingual settings with document- 
aligned comparable training data. This section describes the underlying SONS and two 
variants of our SGNS-based BWE induction model. 


3.1 Skip-Gram with Negative Sampling (SGNS) 

Our departure point is the log-linear SGNS from Mikolov et al. (2013c) as implemented in 
the word2vec package.^ The SGNS model learns word embeddings (WEs) in a similar way 
to neural language models (Bengio et ah, 2003; Collobert &: Weston, 2008), but without a 
non-linear hidden layer. 

In the monolingual setting, we assume one language L with vocabulary V, and a corpus 
of words w G V, along with their contexts c G V^, where V‘^ is the context vocabulary. 
Gontexts for each word Wn are typically neighboring words in a context window of size cs 
(i.e., Wn-cs, ■■■, Wn-i,Wn+i, ■■■, Wn+cs), SO effectively it holds ^ = V? 

Each word type re G E is associated with a vector w G (its pivot word representation 
or pivot word embedding, see Figure 2), and a vector Wc G (its context embedding). 
d is the dimensionality of the WE vectors, which, as a model input parameter, has to be 
set in advance before the training procedure commences. The entries in these vectors are 
latent, and treated as parameters 9 to be learned by the model. In short, the idea of the 
skip-gram model is to scan through the corpus (which is typically unannotated, Mikolov 
et ah, 2013a) word by word in turn (i.e., these are the pivot words), and learn from the pairs 
(word, context word). The learning goal is to maximize the ability of predicting context 
words for each pivot word in the corpus. Let ob = 1 denote that the pair of words {w,v) is 
observed in the corpus and thus belongs to the training set D. The probability of {w,v) G D 
is defined by the softmax function; 


P{ob = l\w, V, 6) 


1 

1 -|- exp(—re • v() 


( 1 ) 


Each word token w in the corpus is treated in turn as the pivot and all pairs of word tokens 
{w,w ± l),...,{w,w ± t{c.s)) are appended to D, where t{cs) is an integer sampled from a 
uniform distribution on {1,... ,cs}.^ The global training objective J is then to maximize 
the probabilities that all pairs from D are indeed observed in the corpus: 


J 


= argmax 
e 


Y. log 


(w,v)^D 


1 -|- exp(—rtJ • Vc) 


( 2 ) 


where 9 are the parameters of the model, that is, pivot and context word embeddings which 
have to be learned. One may see that this objective function has a trivial solution by setting 

1. https://code.google.com/p/word2vec/ 

2. Testing other options for context selection such as dependency-based contexts (Levy & Goldberg, 2014a) 
is beyond the scope of this work, and it was shown that these contexts may not lead to any gains in the 
final WEs (Kiela & Bottou, 2014). 

3. The original skip-gram model utilizes dynamic window sizes, where cs denotes the maximum window 
size. Moreover, the model takes into account sentence boundaries in context selection, that is, it selects 
as context words only words occurring in the same sentence as the pivot word. 
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w = Vc-, and w ■Vc = Val, where Val is a large enough number (Goldberg &: Levy, 2014). In 
order to prevent this trivial training scenario, the negative sampling procedure comes into 
the picture (Collobert & Weston, 2008; Mikolov et ah, 2013c). 

In short, the idea behind negative sampling is to present the model with a set D' 
of artificially created or sampled “negative pivot-context” word pairs {w,v'), which by 
assumption serve as negative examples, that is, they do not occur as observed/positive 
(word, context) pairs in the training corpus. The model then has to adjust the parameters 9 
in such a way to also maximize the probability that these negative pairs will not occur in the 
corpus. While the interested reader may find further details about the negative sampling 
procedure, and the new exact objective function along with its derivation elsewhere (Levy & 
Goldberg, 2014b), for illustrative purposes and simplicity, here we present the approximative 
objective function with negative sampling by Goldberg and Levy (2014): 


J = argmax 

{w^v)^D 


log 


1 


1 -|- exp(—tc • Vc) 


+ E 


log 


1 


1 -|- exp(zi; • v'c) 


( 3 ) 


The free parameters 0 are updated using stochastic gradient descent and backpropagation, 
with learning rate typically controlled by Adagrad (Duchi, Kazan, & Singer, 2011) or with 
a global linearly decreasing learning rate. By optimizing the objective from eq. (3), the 
model incrementally pushes observed pivot WEs towards context WEs of their collocates in 
the corpus. In the words of distributional hypothesis - after training, words that occur in 
similar contexts should end up having similar word embeddings. In other words, to link the 
terminology of distributional hypothesis and the modeling assumptions of SGNS - words 
that predict similar contexts end up having similar word embeddings. 


3.2 Final Model - BWESG: BWE Skip-Gram 

In the next step, we propose a novel method that extends SGNS to work with bilingual 
document-aligned comparable data. Let us assume that we possess a document-aligned 
comparable corpus, defined as C = {di, d 2 , ■ ■ ■, djv} = {(df, d^), (df, d|’),..., (dl), d^)}. 
dj = {dj,dJ) denotes a pair of aligned documents in the source language Ls and the 
target language Lt respectively, and N is the number of pairs in the corpus. and 
are vocabularies associated with languages Ls and Lt- The goal is to learn a shared 
bilingual embedding space given the data (Eigure 1) and document alignments as the only 
bilingual signal during training. We present two strategies that, coupled with SGNS, lead 
to such shared bilingual spaces. An overview of the architecture for learning BWEs from 
document-aligned comparable data with the two strategies is given in Eigures 2(a) and 2(b). 

(1) Merge and Shuffle. In the first step, we merge two documents dJ and dJ from 
the aligned document pair dj into a single “pseudo-bilingual” document d'-. Eollowing 
that, we randomly shuffle the newly constructed pseudo-bilingual document. A shuffle is 
a (random) permutation of the word tokens given in two different languages forming the 
pseudo-bilingual document. The pre-training shuffling step (see Eigure 2(a)) assures that 
each word w, regardless of its actual language, obtains word collocates from both vocabu¬ 
laries. The idea of obtaining bilingual contexts for each pivot word in each pseudo-bilingual 
document will steer the final model towards constructing a shared bilingual space. Since the 
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Input: Pivot word representation Output: Context representations Aligned document pair 



(a) Merge and Shuffle 

Input: Pivot word representation Output: Context representations Aligned document pair 



Figure 2: The architecture of our BWE Skip-Gram (BWESG) model for learning bilingual 
word embeddings from document-aligned comparable data with two different pre¬ 
training strategies: (1) non-deterministic merge and shuffle, (2) deterministic 
length-ratio shuffle. Source language words and documents are drawn as gray 
boxes, while target language words and documents are drawn as blue boxes. The 
right side of the figures (separated by vertical dashed lines) illustrates how a 
pseudo-bilingual document is constructed from a pair of two aligned documents. 

model depends on the alignment at the document level, in order to ensure the bilingual con¬ 
texts instead of monolingual contexts, it is intuitive to assume that larger window sizes will 
lead to better bilingual embeddings. We test this hypothesis and the effect of window size 
in Section 7.3. In another interpretation, since the model relies only on (pseudo-bilingual) 
document level co-occurrence, the window size parameter then just controls the amount of 
random data dropout, that is, the number of positive document-level training examples. 
The locality feature of SGNS is not preserved due to the shuffling procedure. 

(2) Length-Ratio Shuffle. The non-deterministic and uncontrollable nature of the merge 
and shuffle procedure opens up a possibility of accidentally obtaining “bad shuffles” that 


9 

























































































VULIC AND MOENS 


will result in sub-optimal word representations. Therefore, we also propose a deterministic 
strategy for building pseudo-bilingual documents suitable for bilingual training. Source and 
target language words are inserted into an (initially empty) pseudo-bilingual document in 
turn based on the ratio of document lengths, with word order preserved. Document lengths 
are measured in terms of word tokens, and let us denote them as ms and mx for an aligned 
document pair {dj,dJ). Let us assume, without loss of generality, that ms > mx- The 
procedure then proceeds as follows (if mx > ms the procedure proceeds in an analogous 
manner with the roles of dj and dj reversed): 

1. Pseudo-bilingual document d'- is empty: d'- = {}. 

2. Compute the ratio: R = L^J- 

3. Scan through aligned documents ds and dx simultaneously and (3.1) append R word 
tokens from dj into d'-; then (3.2) append 1 word token from dJ. Repeat steps 3.1 
and 3.2 until all word tokens from dJ have been inserted into d'-. 

4. Insert remaining ms mod mx word tokens from dj into d^. 

Using a simple example, assume that we have an English (EN) document {Frodo, Sam, ores, 
goblins, Mordor, ring} and a Spanish (ES) document {anillo,orcos,mago}: the pseudo¬ 
bilingual document would be formed by inserting 1 Spanish word after 2 English words (as 
the length ratio is 6:3 = 2:1). The final pseudo-bilingual document is: 

{FrodoEN, SamEN, anilloES^ otcsen, goblinsEN, orcosESi MordorEN, ringEN, magoEs}- 

In another interpretation, the length-ratio shuffle strategy constructs a single permuta¬ 
tion/shuffle of the pseudo-bilingual document controlled by the word order in two aligned 
documents as well as their length ratio. As before, the model relies on pseudo-bilingual 
document level co-occurrence, and the window size parameter controls the amount of (now 
non-random) data dropout. A difference lies in the fact that this procedure now keeps word 
order intact monolingually while constructing a pseudo-bilingual document. 

The final BWE Skip-gram (BWESG) model then relies on the monolingual variant of 
SONS (or any other monolingual WE induction model) trained on these shuffled/permuted 
pseudo-bilingual documents (using strategies (1) or (2)).^ The model learns word embed¬ 
dings for source and target language words aligned over the d shared embedding dimensions. 
The BWESG-based representation of word w, regardless of its actual language, is then a 
d-dimensional vector: w = [fi,..., fk, ■ ■ ■, fd]- /fc £ 1^ denotes the score for the k-th. shared 
inter-lingual feature within the d-dimensional shared bilingual embedding space. Since all 
words share the embedding space, semantic similarity between words may be computed 
both monolingually and across languages. We will extensively use this property in our 
evaluation tasks. 

4. Baseline Representation Models 

We quickly navigate through other approaches to bilingual word representation learning 
from document-aligned comparable data. The set of models in comparison may be roughly 
clustered into two main groups: (Group I) “pre-BWE” baseline representation models from 
document-aligned data, (Group II) benchmarking BWE induction models that were not 

4. We were also experimenting with GloVe and CBOW, but they were falling short of SGNS on average. 
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originally developed for learning from document-aligned comparable data. While it is es¬ 
sential to compare the BWESG model with other frameworks for learning representations 
from document-aligned data (Group I), it is also crucial to detect main strengths of the 
BWESG model when compared to other approaches in the BWE learning framework which 
can also be adjusted to learn from document-aligned data (Group II). 

4.1 Group I: Baseline Representation Models from Document-Aligned Data 

Basic-MuPTM The early approaches (e.g., Dumais, Landauer, & Littman, 1996; Gar- 
bonell, Yang, Erederking, Brown, Geng, Lee, Frederking, E, Geng, & Yang, 1997) tried to 
mine topical structure from document-aligned comparable texts using a monolingual topic 
model (e.g., LSA or LDA) trained on pseudo-bilingual documents with the target document 
simply appended to its source language counterpart, and then used the discovered latent 
topical structure as a shared semantic space in which both words and documents from two 
languages may be represented in a uniform way. 

More recent work on multilingual probabilistic topic modeling (MuPTM) (Mimno et ah, 
2009; De Smet Sz Moens, 2009; Vulic et ah, 2011) showed that word representations of higher 
quality may be built if a multilingual topic model such as bilingual LDA (BiLDA) is trained 
jointly on document-aligned comparable corpora by retaining the structure of the corpus 
intact (i.e., there is no need to construct pseudo-bilingual documents). 

MuPTM discovers the latent structure of the observed data in the form of K latent 
cross-lingual topics zi,..., zk which optimally describe the generation of observed data. 
Extracting latent cross-lingual topics actually implies learning per-document topic distri¬ 
butions for each document in the corpus (probability scores P{zk\dj)), and discovering 
language-specific representations of these topics given by per-topic word distributions in 
each language (probability scores P{wf\zk) and P{wf\zk))- Latent cross-lingual topics are 
in fact distributions over vocabulary words, and have their language-specific representation 
in each language. Per-document topic distributions and per-topic word distributions are 
obtained after training the topic model on multilingual data. The representation of some 
word w G (or in an analogous manner w G V'^) is then a iL-dimensional vector; 
w = [P{zi\w),.. .,P{zk\w),.. .,P{zk\w)]. 

We call this representation model (RM) Basic-MuPTM (BMu). Since the number of 
topics, that is, the number of vector dimensions K is typically high (Dinu &: Lapata, 2010; 
Vulic et ah, 2011), additional feature pruning (Reisinger & Mooney, 2010) may be employed 
in order to retain only the most descriptive dimensions in the MuPTM-based representation, 
which was shown to improve the performance on several semantic tasks (e.g., BLE or 
SWTC) (Vulic h Moens, 2013a; Vulic et ah, 2015). 

A multilingual topic model is typically trained by Gibbs sampling (Geman & Geman, 
1984; Steyvers & Griffiths, 2007; Vulic et ah, 2015). Similar to the SGNS/BWESG train¬ 
ing procedure, Gibbs sampling for MuPTM/BiLDA also scans the training corpus word by 
word, and then cyclically updates topic assignments for each word token. However, unlike 
BWESG which uses only a subset of document-level training examples, Gibbs sampling for 
MuPTM uses all words from the source language document as well as all words from its 
coupled target language document to influence the topic assignment for the pivot word. The 
BWESG design relying on data dropout leads to decreased training times and computation 
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costs to obtain final representations compared to Basic-MuPTM. 

Association-MuPTM Another representation is also based on the MuPTM framework; it 
contains association scores P{wa\w) for each w, Wa G V^VJV'^ (Vulic & Moens, 2013a) as di¬ 
mensions of real-valued word vectors. These association scores are computed as P{wa\w) = 
'Ylik=i P{'^a\zk)P{zk\w) (Griffiths, Steyvers, & Tenenbaum, 2007), and the word vector is a 
(|P‘^|-|-|17'^|)-dimensional vector: w = |rc),..., |rc), P{wf\w ),..., P{w'^t^\w)]. 

As with Basic-MuPTM, the original word representation may also be pruned post-hoc. 
We call this representation model Association-MuPTM (AMu). Since this approach re¬ 
lies on the MuPTM training plus additional |P'^| • \V'^\ computations to estimate asso¬ 
ciation scores, the cost of obtaining Association-MuPTM representations is even higher 
than for Basic-MuPTM, but it leads to more robust word representations for the BLE task 
(Vulic & Moens, 2013a). While both Basic-MuPTM and Association-MuPTM produce 
high-dimensional real-valued vectors with plenty of near-zero dimensions (the number of di¬ 
mensions is typically measured in thousands) which have to be pruned afterwards with the 
pruning parameter often set ad-hoc, BWESG produces lower-dimensional dense real-valued 
vectors, and no additional post-hoc feature pruning is required for BWESG. 

Traditional-PPMI A traditional approach to building bilingual word representations in 
(cross-lingual) distributional semantics is to compute weighted co-occurrence scores (e.g., 
using PMI, TF-IDE) between pivot words and their context words in a window of predefined 
size, plus an external bilingual lexicon to align context words/dimensions across languages 
(Gaussier et ah, 2004; Laroche & Langlais, 2010). A weighting function (WeF), which is a 
standard choice in distributional semantics and yields optimal or near-optimal results over 
a group of semantic tasks (Bullinaria & Levy, 2007), is the smoothed positive pointwise 
mutual information statistic (Pantel & Lin, 2002; Turney &: Pantel, 2010). Furthermore, in 
order to induce context words without the need for a readily available lexicon, we employ 
the bootstrapping procedure from Peirsman and Pado (2011), Vulic and Moens (2013b). 
This representation model is called Traditional-PPMI (TPPMI). The word representation is 
an i7-dimensional vector: w = [sci(rt;, ci),..., sck{w, Ck), ■ ■ ■, scji{w, cj?)]. The dimensions 
of the vector space are R one-to-one word translation pairs Ck = (cf,c^), and sck{w,Ck) is 
the weighted co-occurrence score of the pivot word w and the k-th context feature, where 
one computes the co-occurrence score using cf if re G or if tc G V'^. 

Vector dimensions Ck = (cf;C^) in the Traditional-PPMI representation and similar 
models with other WeFs are typically the most frequent and reliable translation pairs in 
the corpus. As opposed to BWESG, the obtained word vectors are again high-dimensional 
(typically thousands of dimensions) sparse real-valued vectors. In addition, traditional- 
PPMI is a purely local distributional model deriving distributional context knowledge from 
narrow context windows (typically 3-10 surrounding words, e.g., Laroche &: Langlais, 2010). 
A bootstrapping approach (Vulic & Moens, 2013b) which we use to induce the Traditional- 
PPMI representation starts from an automatically learned seed lexicon of one-to-one trans¬ 
lation pairs obtained using some other model (e.g., Basic-MuPTM or Association-MuPTM), 
and then gradually detects new dimensions of the shared bilingual semantic space. We refer 
the interested reader to the relevant literature (Vulic &: Moens, 2013b) for more details. 
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4.2 Group II: BWE Induction Models Adjusted to Document-Aligned Data 

BiCVM Hermann and Blunsom (2014b) introduced a model called BiCVM (Bilingual Com¬ 
positional Vector Model) that learns bilingual word embeddings from a sentence-aligned 
parallel corpus C = {si, S 2 , • • •, sm} = {(sf, sf), (sfi ^2 )>•••> («:^> 

now denotes a pair of aligned sentences. The model assumes that the aligned sentences 
have the same meaning, which implies that their sentence representations should be sim¬ 
ilar. Assume two functions / and g which map sentences given in the source and lan¬ 
guage respectively to their semantic representations in where d is again the repre¬ 
sentation dimensionality. The energy of the model given two sentences {sj, sj) G C is 
then defined as: E{sj,sJ) = ||/(s|') — g{sj)\\. The goal is to minimize E for all se¬ 
mantically equivalent sentences (i.e., aligned sentences) in the corpus. In order to pre¬ 
vent the model from degenerating, they use a noise-contrastive large-margin update which 
ensures that the representations of non-aligned sentences observe a certain margin from 
each other. For every pair of parallel sentences (sj,sj), they sample a number of addi¬ 
tional negative sentence pairs {Sj,n^^g) from the corpus (i.e., the sampled pairs are not 
observed as positive pairs in C). These noise samples are used in formulating the hinge 
loss as follows: E{sj, sJ) = m.a.'K{mrg + AE{sj , sJ , 0), where mrg is the margin, and 

AE{sj , sJ, n^f,g) = E{sj , sJ) — E{sj , n^eg)- The loss is minimized for every pair of parallel 
sentences in the corpus with L2-regularization on the model parameters. The number of 
noise samples per each positive pair is a hyper-parameter of the model. A semantic signal 
is propagated from aligned sentences back to the individual words to obtain bilingual word 
embeddings. While the BiCVM model was originally built for sentence-aligned parallel 
data, exactly the same idea may be applied to document-aligned non-parallel data. In this 
paper, we test its ability to learn from noisier comparable data. The BWESG model is com¬ 
pared against BiCVM when inducing BWEs from both data types: comparable and parallel. 

Mikolov Another collection of BWE induction models (Mikolov et ah, 2013b; Faruqui 
& Dyer, 2014; Dinu, Lazaridou, & Baroni, 2015; Lazaridou, Dinu, & Baroni, 2015) assumes 
the following setup: first, two monolingual embedding spaces, ^nd are in¬ 

duced separately in each of the two languages using a standard monolingual WE model 
such as SONS (Mikolov et ah, 2013a, 2013c). dims and dimx denote the dimensionality of 
monolingual embedding spaces in the source and target language respectively. The bilingual 
signal is provided in the form of word translation pairs {xi, yi), where Xi G , yi G V'^, and 
Xi G yl G Training is cast as a multivariate regression problem: it implies 

learning a function that maps the source language vectors from the training data to their 
corresponding target language vectors. A standard approach (Mikolov et ah, 2013b; Dinu 
et ah, 2015) is to assume a linear map W G where a L 2 -regularized least- 

squares error objective (i.e., ridge regression) is used to learn the map W: it is learned by 
solving the following optimization problem (typically by stochastic gradient descent): 
miri-Yy q x dirrt'jp ||XW-Y||^ + A||W|||. 

X and Y are matrices obtained through the respective concatenation of source language 
and target language vectors from training pairs. Once the linear map W is estimated, any 

5. A very similar (but more expensive) model which also learns from parallel sentence-aligned data was 

also introduced by Chandar et al. (2014). 
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previously unseen source language word vector Xu may be straightforwardly mapped into 
the target language embedding space as WaTu- After mapping all vectors x, x G F'®', 

the target embedding space serves as a bilingual embedding space (Figure 1). 

Although the main strength of the model is its ability to learn embeddings on larger 
monolingual training sets, the model may also be adjusted to the setting where the only 
training data are document-aligned comparable data as follows: ( 1 ) Automatically learn 
a seed lexicon or reliable one-to-one translation pairs from document-aligned data using 
a bootstrapping approach from Peirsman and Pado (2010), Vulic and Moens (2013b), (2) 
Train two separate monolingual embedding spaces on two separated halves of the document- 
aligned data set (i.e., using only source language documents and only target language doc¬ 
uments), (3) Learn the mapping between the two spaces using the pairs from Step 1. 

BilBOWA Another collection of BWE induction models jointly optimizes two monolin¬ 
gual objectives, with the cross-lingual objective acting as a cross-lingual regularizer during 
training (Klementiev et ah, 2012; Gouws et ah, 2015; Soyer, Stenetorp, &: Aizawa, 2015). 
The idea behind joint training may be summarized by the simplified formulation (Luong, 
Pham, & Manning, 2015): ^{Monog + Manor) + 

The monolingual objectives Monos and Monor ensure that similar words in each lan¬ 
guage are assigned similar embeddings and aim to capture the semantic structure of each 
language, whereas the cross-lingual objective Bi ensures that similar words across languages 
are assigned similar embeddings, and ties the two monolingual spaces together into a bilin¬ 
gual space. Parameters 7 and 5 govern the influence of the monolingual and bilingual com¬ 
ponents.® The bilingual signal for these models, now acting as the cross-lingual regularizer 
during the joint training, is provided in sentence-aligned parallel data. Although they use 
the same data sources, the models differ in the choice of monolingual and cross-lingual ob¬ 
jectives. In this work, we opt for the BilBOWA model from (Gouws et ah, 2015) as the rep¬ 
resentative model to be included in the comparisons, due to its previous solid performance 
and robustness in the BLE task, its reduced complexity reflected in fast computations on 
massive datasets, as well as its public availability: https://github.com/gouwsmeister/biibowa. 
In short, the BilBOWA model combines SONS for the monolingual objectives together 
with the cross-lingual objective that minimizes the L 2 -I 0 SS between the bag-of-word vectors 
of parallel sentences. For more details about the exact training procedure, we refer the 
interested reader to the work from Gouws et al. (2015). 

Again, although the main strength of the model is its ability to learn embeddings 
on larger monolingual training sets, the model may also be adjusted to the setting with 
document- or sentence-aligned data by: ( 1 ) using two halves of the aligned corpus for sep¬ 
arate monolingual training, ( 2 ) using the alignment signal for bilingual training. 

5. From Word Representations to Semantic Word Similarity 

Assume now that we have induced bilingual word representations, regardless of the chosen 
RM. Given two words Wi and wj, irrespective to their actual language, we may compute the 
degree of their semantic similarity by applying a similarity function (SF) on their vector 

6. Setting 7 = 0 reduces the model to the setting similar to BiCVM (Hermann & Blunsom, 2014b). 7 = 1 
results in the models from (Klementiev et ah, 2012; Gouws et al., 2015; Soyer et ah, 2015). 
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representations wii and sim{wi,Wj) = SF{v^i,vi^j). Different choices (or rather families 
of) SFs are cosine, the Kullback-Leibler or the Jensen-Shannon divergence, the Hellinger 
distance, the Jaccard index, etc. (Lee, 1999; Cha, 2007), and different RMs typically require 
different SFs to produce optimal or near-optimal results over various semantic tasks. When 
working with word embeddings, a standard choice for SF is cosine similarity (cos) (Mikolov 
et al., 2013c), which is also a typical choice in traditional distributional models (Bullinaria 
& Levy, 2007). The similarity is then computed as follows: 

v^i -n^j 

sim[Wi,Wj) = cos{wi,Wj) = (4) 

\Wi\ ■ \wj\ 

On the other hand, a good choice for SF when working with probabilistic RMs such as 
Basic-MuPTM and Association-MuPTM RS is the Hellinger distance (Pollard, 2001; Cha, 
2007; Kazama, Saeger, Kuroda, Murata, &: Torisawa, 2010), which displays excellent results 
in the BLE task (Vulic & Moens, 2013a). The similarity between words Wi and wj using 
the Hellinger distance is computed as follows: 


sim{wi, Wj) 


K 

E 

i=l 




(5) 


Note that the Hellinger distance is applicable only if word representations are probability 
distributions, which is the case for Basic-MuPTM and Association-MuPTM. P{f'^\wi) de¬ 
notes the probability score for the fc-th dimension (/(.) in the vector representation with 
Basic-MuPTM or Association-MuPTM.^ 

For each word Wi, we can build a ranked list RL{wi) which consists of all other words wj 
ranked according to their respective semantic similarity scores sim{wi,Wj). Additionally, 
we label the ranked list RL{wi) that is pruned at position M as RLM{wi). Since we may 
retain language labels for words when training in multilingual settings (e.g., language labels 
are marked by different colors in Figure 2), we may compute: (1) monolingual similarity, 
e.g., given Wi G we retain only Wj G in the ranked list (analogous for Wi G F^), 
(2) cross-lingual similarity (CLSS), e.g., given Wi G F*^, we retain only wj G F^, and (3) 
multilingual similarity, where we retain all words Wj G F*^ U F^. When computing CLSS 
for Wi, the most similar word cross-lingually is called the cross-lingual nearest neighbor. 

We will employ the models of context-insensitive CLSS at the word type level to ex¬ 
tract bilingual lexicons from document-aligned or sentence-aligned data, and to compare all 
representation models in the BLE task in Section 7. 


5.1 Context Sensitive Models of (Cross-Lingual) Semantic Similarity 

The context-insensitive models of semantic similarity provide ranked lists of semantically 
similar words invariably or in isolation, and they operate at the level of word types. They 
do not explicitly encode different word senses. In practice, it means that, given a sentence 
“The coach of his team was not satisfied with the game yesterday. ”, these context-insensitive 

7. Prior work has shown that the results for Basic-MuPTM and Association-MuPTM are slightly higher 
when cosine is replaced with the Hellinger distance. Therefore, in this particular case we have opted for 
the Hellinger distance to report a more competitive baseline. 
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CLSS models are not able to detect that the Spanish word entrenador is more similar to 
the polysemous English word coach in the context of this sentence than the Spanish word 
autocar, although autocar is listed as the most semantically similar word to coach glob¬ 
ally/invariably without any observed context. In another example, while the Spanish words 
partido, encuentro, cerilla or correspondencia are all highly similar to another ambiguous 
English word match when observed in isolation, given the Spanish sentence ’’She was un¬ 
able to find a match in her pocket to light up a cigarette. ”, it is clear that the strength of 
cross-lingual semantic similarity should change in context as only cerilla exhibits a strong 
cross-lingual semantic similarity to match within this particular sentential context. 

The goal now is to build BWE-based models of cross-lingual semantic similarity in 
context, similar to context-aware CLSS models proposed by Vulic and Moens (2014). Two 
key questions are: (i) How to provide BWE-based representations beyond word level to 
represent the context of a word token?; (ii) How to use the contextual knowledge in a 
context-sensitive model of semantic similarity? 

Following Vulic and Moens (2014), given a word token w in context (e.g., a window of 
words, a sentence, a paragraph, or a document), we build its context set or rather context 
bag Con{w) = {cwi, ..., cwr} by harvesting r neighboring words in the chosen context scope 
(e.g., the context bag may comprise all content-bearing words in the same sentence as the 
pivot word token, the so-called sentential context). In order to present the context Con{w) 
in the d-dimensional embedding space, we need to apply a model of semantic composition 
to learn its d-dimensional vector representation Con{vo). 

Formally, given word w, we may specify the vector representation of the context bag 
Con{w) as the d-dimensional vector/embedding: 

Con{w] = cw\ -k cw\ -k .. .-k cwl ( 6 ) 

where cwt,..., cwl are d-dimensional WEs learned from the data, and * is a compositional 
vector operator such as addition, point-wise multiplication, tensor product, etc. 

A plethora of models for semantic composition have been proposed in the relevant liter¬ 
ature, differing in their choice of vector operators, input structures and required knowledge 
(Mitchell & Lapata, 2008; Baroni & Zamparelli, 2010; Rudolph & Giesbrecht, 2010; Socher, 
Huval, Manning, Sz Ng, 2012; Blacoe & Lapata, 2012; Clarke, 2012; Hermann & Blunsom, 
2014b; Milajevs, Kartsaklis, Sadrzadeh, &: Purver, 2014), to name only a few. In this work, 
driven by the observed linear linguistic regularities in the embedding spaces (Mikolov et ah, 
2013d), we opt for simple addition (denoted by -|-) from Mitchell and Lapata (2008) as 
the compositional operator, due to its simplicity, the ease of applicability on bag-of-words 
contexts, and its relatively solid performance in various compositional tasks (Mitchell & 
Lapata, 2008; Milajevs et ah, 2014). The d-dimensional embedding Con{w] is then: 

Con{w] = cwl + CW 2 -h ... -h cwl (7) 

If we use any BWE-based RM, we may compute the context-sensitive semantic similar¬ 
ity score sim{wi,tj,Con{wi)) between tj and Wi given its context Con{wi) in the shared 
bilingual embedding space as follows: 

sim{wi,tj,Con{wi)) = SF{wl,lj) (8) 
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tj G V'^ is any target language word, and tj its word representation, while w' is the new 
“contextualized” vector representation for Wi modulated by its context Con{wi), that is, 
its context-aware representation. Vulic and Moens (2014) introduced a linear interpolation 
of two d-dimensional vectors as a plausible solution for the modulation/contextualization. 
The modulation of representation for Wi is computed as follows: 

re) = (1 — A) • tCi -|- A • Con{wi] (9) 

where v^i is the word embedding for Wi computed at the word type level, Con{wi] is the 
embedding for the context bag computed using eq. (7), and A is an interpolation parameter. 
Another set of similar models that can yield context-sensitive similarity computations has 
been proposed very recently, and has displayed very competitive results regardless of its 
simplicity (Melamud, Levy, & Dagan, 2015). Here, we present two best scoring context- 
sensitive models which we adapt to the bilingual setting: 


Add-Melamud: sim(wi,ti,Con{wi)) =- 1 —-—^—rr- 

\Con(Wi)\ + l 

Mult-Melamud: sim{wi,tj,Con{wi)) = \con(vji)\+i SF{wi,tj) ■ SF{cwi,tj) 

y cWi£Con{wi) 


Note that for the Mult model one has to avoid negative values, so a simple shift to an all¬ 
positives interval is required, e.g., the shifted cosine score becomes cos'{x,y) = _ 

Unlike the models from Vulic and Moens (2014), these two models do not aggregate single 
word representations into one vector that represents the context, but compute similarity 
scores separately with each word from the context. For more details regarding the models, 
we refer the interested reader to the original paper (Melamud et ah, 2015). 

We will employ the models of context-sensitive CLSS at the word token level to compare 
all representation models in the task of suggesting word translations in context in Section 8. 


6. Training Setup 

Training Data. To induce bilingual word embeddings as well as to be directly comparable 
with baseline representations from prior work, we use a dataset comprising a subset of 
comparable Wikipedia data available in three language pairs (Vulic & Moens, 2013b, 2014)®: 
(i) a collection of 13,696 Spanish-English Wikipedia article pairs (ES-EN), (ii) a collection 
of 18,898 Italian-English Wikipedia article pairs (IT-EN), and (iii) a collection of 7,612 
Dutch-English Wikipedia article pairs (NL-EN). All corpora are theme-aligned comparable 
corpora, that is, the aligned document pairs discuss similar themes, but are in general 
not direct translations of each other. To be directly comparable to prior work in the two 
evaluation tasks (Vulic &: Moens, 2013b, 2014), we retain only nouns that occur at least 
5 times in the corpus. Lemmatized word forms are recorded when available, and original 
forms otherwise. TreeTagger (Schmid, 1994) is used for POS tagging and lemmatization. 
After the preprocessing steps vocabularies comprise between 7,000 and 13,000 noun types 
for each language in each language pair, and the training corpora are quite small: ranging 

8. Available online: people.cs.kuleuven.be/~ivan.vulic/software/ 
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Corpus: 

Pair: 


Wikipedia 


Europarl 

ES-EN 

IT-EN 

NL-EN 

ES-EN 

IT-EN 

NL-EN 

Average length (OTHER) 

111 

84 

51 

29 

29 

27 

Average length (EN) 

174 

154 

129 

28 

29 

27 

Average length difference 

127 

125 

102 

3 

4 

4 


Table 1; Training data statistics; Non-parallel document-aligned Wikipedia vs parallel 
sentence-aligned Europarl for all three language pairs. OTHER = ES, IT or NL. 
Lengths are measured in word tokens. Averages are rounded to the closest integer. 


from approximately 1.5M tokens for NL-EN to 4M for ES-EN. Exactly the same training 
data and vocabularies are used to train all representation models in comparison (both from 
Group I and Group II, see Section 4). 

We also demonstrate that it is simple and straightforward to train BWESG on par¬ 
allel sentence-aligned data using the same modeling principles. For that purpose, we use 
Europarl.v7 (Koehn, 2005) for all three language pairs obtained from the OPUS website 
(Tiedemann, 2012).^ As the only preprocessing step, we retain only words occurring at 
least 5 times in the corpus. Each corpus contains approximately 2M parallel sentences, 
vocabularies are by an order of magnitude larger than from the smaller Wikipedia data 
(i.e., varying from 45K EN word types to 75K NL word types), and the corpora sizes are 
approximately 120M tokens. Data statistics of the two data sources, Wikipedia vs Europarl, 
are provided in Table 1. The statistics reveal the different nature of the two corpora, with 
significantly more variance and noise reported for the Wikipedia data. 

Trained BWESG Models To test the effect of random shuffling in the merge and shujfle 
BWESG strategy, we have trained the BWESG model with 10 random corpora shuffles for 
all three training corpora. We also train BWESG with the length-ratio shuffle strategy. All 
parameters are set to default suggested parameters for SGNS from the word2vec package: 
stochastic gradient descent (SGD) with a linearly decreasing global learning rate of 0.025, 
25 negative samples, subsampling rate le — 4, and 15 epochs. 

We have varied the number of dimensions d = 100, 200,300. We have also trained 
BWESG with d = 40 to be directly comparable to readily available sets of BWEs from 
prior work (Ghandar et ah, 2014). Moreover, to test the effect of window size on the final 
results, i.e., the number of positives used for training, we have varied the maximum window 
size cs from 4 to 60 in steps of 4.^^ 

We will make our pre-training and training code for BWESG publicly available, along 
with all BWESG-based bilingual word embeddings for the three language pairs at; 
http://liir.cs.kuleuven.be/software.php. 

Baseline Representations: Group I All parameters of the baseline representation 
models (i.e., topic models and their settings, the number of dimensions K, the values for 
feature pruning, window size, weighting and similarity functions) were optimized in prior 
work. Therefore, the settings are adopted directly from previous work (Griffiths et ah, 

9. http://opus.lingfil.uu.se/ 

10. We remind the reader that we slightly abuse terminology here, as the BWESG windows do not include 

the locality component any more. 
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2007; Bullinaria & Levy, 2007; Dinu & Lapata, 2010; Vulic &: Moens, 2013a, 2013b; Kiela 
& Clark, 2014), and we encourage the interested reader to check the details and exact 
parameter setup in the relevant literature. We provide only a short overview here. 

For Basic-MuPTM and Association-MuPTM, as in (Vulic & Moens, 2013a), a bilingual 
latent Dirichlet allocation (BiLDA) model was trained with K = 2000 topics and the 
standard values for hyper-parameters: a = 50/K, (3 = 0.01 (Steyvers & Griffiths, 2007). 
Post-hoc semantic space pruning was employed with the pruning parameter set to 200 for 
Basic-MuPTM and to 2000 for Association-MuPTM. We refer the reader to the relevant 
paper for more details. 

For Traditional-PPMI, as in (Vulic &: Moens, 2013b), a seed lexicon was automatically 
obtained by bootstrapping from the initial seed lexicon of reliable pairs stemming from the 
Association-MuPTM model (with the same parameters for Association-MuPTM as listed 
above). The window size was fixed to 6 in both directions. We again refer the reader to the 
paper for more details. 

Baseline Representations: Group II All baseline BWE models were trained with the 
same number of dimensions as BWESG: d = 100, 200, 300. Other model-specific parameters 
were taken as suggested in prior work. 

For BiCVM, we use the tool released by the authors.We train an additive model, with 
hinge loss margin mrg = d as in the original paper, batch size of 50, and noise parameter 
of 10. All models were trained with 200 iterations. 

For Mikolov, we train two monolingual SGNS models using the original word2vec 
package, SGD with a global learning rate of 0.025, 25 negative samples, subsampling rate 
le — 4, and 15 epochs. The seed lexicon required to learn the mapping between two mono¬ 
lingual spaces is exactly the same as for Traditional-PPMI. 

For BilBOWA, we use SGD with a global learning rate 0.15 for training^^, 25 negative 
samples, subsampling rate le — 4, and 15 epochs. For BilBOWA and Mikolov, we vary 
the window size the same way as in BWESG. 

Similarity Functions Unless stated otherwise, a similarity function used in all simi¬ 
larity computations with all RMs is cosine (cos). The only exceptions are Basic-MuPTM 
and Association-MuPTM where the Bellinger distance (HD) was used since it consistently 
outperformed cosine for these two RM types in prior work (see Footnote 7). 

A Roadmap to Experiments In the first experiment, we quickly visually inspect the ob¬ 
tained lists of semantically similar words using the BWESG bilingual representation model. 
Following that, we compare BWESG-based models for bilingual lexicon extraction (BLE) 
and suggesting word translations in context (SWTG) against both groups of baseline mod¬ 
els discussed in Section 4. The experiments and results for the BLE task are presented in 
Section 7, while the experiments and results for SWTG are presented in Section 8. 


11. https://github.com/karlmoritz/bicvm 

12. Suggestions for parameter values received through personal correspondence with the authors. The soft¬ 
ware is available online: https://github.com/gouwsmeister/bilbowa 
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Spanish-English (ES-EN) Italian-English (IT-EN) Dutch-English (NL-EN) 


(1) 

(2) 

(3) 

(1) 

(2) 

(3) 

(1) 

(2) 

(3) 

reina 

reina 

reina 

madre 

madre 

madre 

schilder 

schilder 

schilder 

(Spanish) 

(English) 

(Combined) (Italian) 

(English) 

(Combined) (Dutch) 

(English) 

(Combined) 

rey 

queen( -f-) 

queen(-h) 

padre 

mother (+) 

mother (+) 

kunstschi\derpainter(+) 

painter (-f-) 

trono 

heir 

rey 

moglie 

father 

padre 

schilderij 

painting 

kunstschilder 

monarca 

throne 

trono 

sorella 

sister 

moglie 

kunstenaar 

portrait 

painting 

heredero 

king 

heir 

figlia 

wife 

father 

olieverf 

artist 

schilderij 

matrimonio 

royal 

throne 

figlio 

daughter 

sorella 

portret 

canvas 

kunstenaar 

hijo 

reign 

monarca 

fratello 

son 

figlia 

schilderen 

brush 

portrait 

reino 

succession 

heredero 

casa 

friend 

figlio 

frans 

cubism 

olieverf 

reinado 

princess 

king 

amico 

childhood 

sister 

nederlands 

art 

portret 

regencia 

marriage 

matrimonio 

marito 

family 

fratello 

componist 

poet 

schilderen 

duque 

prince 

royal 

donna 

cousin 

wife 

beeldhouwer drawing 

artist 


Table 2; Example lists of top 10 semantically similar words for all 3 language pairs ob¬ 
tained using BWESG (length-ratio shuffle); d = 200,cs = 48; (col 1.) only source 
language words (ES/IT/NL) are listed while target language words are skipped 
(monolingual similarity); (2) only target language words (EN) are listed (cross- 
lingual similarity); (3) words from both languages are listed (multilingual similar¬ 
ity). The correct one-to-one translation is marked by (-I-). 


7. Evaluation Task I: Bilingual Lexicon Extraction 

7.1 Task Description 

One may employ the context-insensitive CLSS models from Section 5 to extract bilingual 
lexicons automatically from data. By harvesting cross-lingual nearest neighbors, one is able 
to build a bilingual lexicon of one-to-one translation pairs (ref, rej). We test the validity of 
our BWEs and baseline representations in the BLE task. 

7.2 Experimental Setup 

Test Data Eor each language pair, we evaluate on standard 1,000 ground truth one-to-one 
translation pairs built for the three language pairs (ES/IT/NL-EN) by Vulic and Moens 
(2013a, 2013b). Translation direction is ES/IT/NL —)■ EN. The data is available online. 

Evaluation Metrics Since we can build a one-to-one bilingual lexicon by harvesting 
one-to-one translation pairs, the lexicon quality is best reflected in the Acci score, that is, 
the number of source language (ES/IT/NL) words wf from ground truth translation pairs 
for which the top ranked word cross-lingually is the correct translation in the other language 
(EN) according to the ground truth over the total number of ground truth translation pairs 
{=1000) (Gaussier et ah, 2004; Tamura et ah, 2012; Vulic h Moens, 2013b). Similar trends 
are observed within a more lenient setting with Acc^ and ^ccio scores, but we omit these 
results for clarity and the fact that the actual BLE performance is best reflected in Acci. 


13. http://people.cs.kuleuven.be/ Ivan.vulic/software/ 
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Spanish-English (ES-EN) 


Italian-English (IT-EN) 


BWESG 

BMu 

AMu 

TPPMl 

BWESG 

BMu 

AMu 

TPPMI 

cebolla 

cebolla 

cebolla 

cebolla 

golfo 

golfo 

golfo 

golfo 

onion(+) 

dessert 

dessert 

sauce 

gulf(+) 

whale 

coast 

coast 

dish 

salad 

walnut 

cheese 

coast 

dolphin 

isthmus 

sea 

marinade 

nut 

salad 

garlic 

coastline 

coast 

coastline 

island 

cuisine 

walnut 

nut 

salad 

bay 

suborder 

fjord 

bay 

soup 

rice 

hazelnut 

chili 

island 

cadmium 

ferry 

lagoon 

sauce 

toast 

porridge 

onion(+) 

peninsula 

ferry 

monsoon 

harbour 

cheese 

porridge 

rice 

cuisine 

settlement 

monsoon 

mainland 

beach 

coriander 

paddy 

marinade 

flavor 

shore 

fjord 

seaside 

shore 

vegetable 

tuber 

toast 

bread 

tourism 

isthmus 

isle 

river 

tortilla 

potato 

paddy 

dish 

ferry 

mainland 

suborder 

lake 


Table 3: Example lists of top 10 semantically similar words for ES-EN and IT-EN, obtained 
using BWESG (length-ratio, d = 200, cs = 48), and the three representation 
models from Group I. The correct translation is marked by (-f). 


7.3 Results and Discussion 

7.3.1 Experiment 0: Qualitative Analysis and Comparison 

Table 2 displays top 10 semantically similar words monolingually, across-languages and 
combined/multilingually for one ES, IT and NL word. BWESG is able to find semantically 
coherent lists of words for all three directions of similarity (i.e., monolingual, cross-lingual, 
multilingual). In the combined (multilingual) ranked lists, words from both languages are 
represented as top similar words. This initial qualitative analysis already demonstrates 
the ability of BWESG to induce a shared bilingual embedding space using only document 
alignments as bilingual signals. 

In another brief analysis, we qualitatively compare the cross-lingual ranked lists acquired 
by BWESG with the other three baseline CLSS/BLE models from Group I. The lists for one 
ES word and one IT word are presented in Table 3. Eor the two example words, BWESG is 
the only model which is able to rank the actual correct translations as nearest cross-lingual 
neighbors. It is already symptomatic that the word gulf, which is the correct translation 
for golfo, does not occur in the ranked list RLio^golfo) at all in case of the three baseline 
models. We will soon quantitatively confirm this initial suspicion, and demonstrate that 
BWESG is superior to the three baseline models in the BLE task. 

As an aside. Table 3 also clearly reveals the difficulty of judging the quality of models 
for computing semantic similarity/relatedness solely based on the observed output of the 
models. The lists RLiQ{cebolla) and RLiQ{golfo) appear significantly different across all 
four models, yet all these lists contain words which appear semantically related to the source 


14. We also conducted a small experiment on solving word analogies using monolingual English embed¬ 
ding spaces, and then we repeated the experiment with the same vocabulary and bilingual English- 
Spanish/Italian/Dutch embedding spaces. The results follow the findings from (Faruqui & Dyer, 2014), 
where only slight (and often insignificant) fluctuations for SGNS vectors were reported (e.g., the fluctu¬ 
ations are < 1% on average in our experiments) when moving from monolingual to bilingual embedding 
spaces. We may conclude that the linguistic regularities (Mikolov et al., 2013d) established for monolin¬ 
gual embedding spaces induced by SGNS also hold in bilingual embedding spaces induced by BWESG. 
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Pair: 


ES-EN 



IT-EN 



NL-EN 


BWESG 

d=100 

d=200 

d=300 

d=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

Merge and Shuffle 

cs:16,MIN 

0.607 

0.600 

0.577 

0.585 

0.597 

0.571 

0.293 

0.244 

0.219 

cs:16,AVG 

0.617 

0.613 

0.596 

0.599 

0.601 

0.583 

0.300 

0.254 

0.224 

cs:16,MAX 

0.625 

0.630 

0.613 

0.607 

0.606 

0.596 

0.307 

0.267 

0.233 

cs:48,MIN 

0.658 

0.676 

0.672 

0.662 

0.677 

0.672 

0.378 

0.366 

0.354 

cs:48,AVG 

0.665 

0.685 

0.688 

0.669 

0.683 

0.683 

0.389 

0.381 

0.363 

cs:48,MAX 

0.675 

0.694 

0.705 

0.677 

0.692 

0.689 

0.394 

0.395 

0.377 

BWESG 

d=100 

d=200 

d=300 

d=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

Length-Ratio 

cs:16 

0.627 

0.610 

0.602 

0.613 

0.614 

0.595 

0.303 

0.275 

0.237 

cs:48 

0.678 

0.701 

0.703 

0.679 

0.689 

0.692 

0.397 

0.396 

0.382 

BWESG 

d=100 

d=200 

d=300 

d=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

No Shuffling 

cs:16 

0.218 

0.176 

0.139 

0.209 

0.198 

0.162 

0.070 

0.068 

0.049 

cs:48 

0.511 

0.497 

0.480 

0.523 

0.540 

0.526 

0.214 

0.198 

0.197 

BMu 

0.441 

0.441 

0.441 

0.575 

0.575 

0.575 

0.237 

0.237 

0.237 

AMu 

0.518 

0.518 

0.518 

0.618 

0.618 

0.618 

0.236 

0.236 

0.236 

TPPMI 

0.577 

0.577 

0.577 

0.647 

0.647 

0.647 

0.206 

0.206 

0.206 


Table 4; BLE performance in terms of Acci scores for all tested BLE models for Spanish- 
English, Italian-English and Dutch-English with all bilingual word representations 
learned from document-aligned Wikipedia data. For BWESG with merge and 
shuffle we report maximum (MAX), minimum (MIN) and average (AVG) scores 
over 10 random corpora shuffles. Highest scores per column are in bold. 


word. Therefore, we require a more systematic quantitative task-oriented comparison of 
induced word representations. 

7.3.2 Experiment I: BWESG vs Group I 

Table 4 shows the first set of results on the BLE task; we report scores with two different 
BWESG strategies as well as with a BWESG model which does not shuffle pseudo-bilingual 
documents. The previous best reported Acci scores with baseline representations for the 
same training-|-test combination are also reported in the table. By zooming into the table 
multiple times, we summarize the most important findings. 

BWESG vs Baseline Representations The results clearly reveal the superior per¬ 
formance of the BWESG model for BLE which relies on our new framework for inducing 
bilingual word embeddings from document-aligned comparable data over other BLE models 
relying on previously used bilingual word representations from the same type of training 
data. The increase in Acci scores over the best scoring baseline models is 22.2% for ES-EN, 
7% for IT-EN and 67.5% for NL-EN. 

BWESG Shuffling Strategy Although both BWESG strategies display results that are 
above established baselines, there is a clear advantage to the length-ratio shuffle strategy, 
which displays a solid and robust performance across a variety of parameters and all three 
language pairs. Another advantage of that strategy is the fact that it has a deterministic 
outcome and does not suffer from “sub-optimal” random shuffles. In summary, we suggest 
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Window size 


Window size 


Window size 


(a) Spanish-English (b) Italian-English (c) Dutch-English 

Figure 3: Acci scores in the BLE task with BWESG length-ratio shuffle for all 3 language 
pairs, and varying valnes for parameters cs and d. Solid (red) horizontal lines 
denote the highest baseline Acci scores for each langnage pair. Thicker dotted 
lines refer to BWESG withont shnffling. 


using the length-ratio shuffle strategy in future work, and along the same line we opt for 
that strategy in all further experiments. 

The results also reveal that shuffling is universally nseful, as BWESG without shuffling 
relies largely on monolingual contexts and cannot reach the performance of BWESG with 
shnffling. A partial remedy for the problem is to train BWESG with more docnment- 
level training pairs (i.e., by increasing the window size), but that leads to prohibitively 
expensive models, and nonetheless BWESG without shuffling with larger cs-s still falls 
short of BWESG with both shnffling strategies (see also Fignres 3(a)-3(c)). 

Window Size: Number of Training Pairs The results confirm the intnition that larger 
window sizes, i.e., more training examples lead to better resnlts in the BLE task. For all 
embedding dimensions d-s, BWESG exhibits a superior performance for cs = 48 than for 
cs = 16, and the performance with cs = 48 and cs = 60 seems relatively stable; intuitively, 
more training pairs leads to a slightly better BLE performance, but the curve slowly flattens 
out (Figures 3(a)-3(c)). This finding reveals that even a coarse tuning of these parameters 
might lead to optimal or near-optimal scores for BLE with BWESG. 

Differences across Language Pairs A lower increase in Acci scores for IT-EN is at¬ 
tributed to the fact that the test set for IT-EN comprises IT words with occurrence frequen¬ 
cies above 200 in the training data (Vnlic & Moens, 2013a), while the other two test sets 
comprise randomly sampled words covering all frequency spectra. As expected, all models 
in comparison are able to effectively utilize distributional signals for higher-freqnency words, 
but BWESG still displays the best performance, and these improvements in Acci scores are 
statistically significant (using McNemar’s statistical significance test, p < 0.05).^^ 

Further, the lowest overall scores for all models in comparison are observed for NL-EN. 
We attribute it to using less training data for NL-EN when compared to ES-EN and IT-EN 
(i.e., training corpora for ES-EN and IT-EN are almost triple the size of training corpora 
for NL-EN). However, we observe that the increase obtained by BWESG is even more 
prominent in this setting with limited training data. The lower results of TPPMI compared 

15. McNemar’s significance test is very common in the NLP literature, especially when Acci scores are 
reported. It utilizes the standard 2x2 contingency table, and may be observed as a paired version of the 
more common chi-square test. The reader is referred to the original work (McNemar, 1947). 
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Pair: 


ES-EN 



IT-EN 



NL-EN 


BWESG 

d=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

Length-Ratio 

CS-A8 

0.678 

0.701 

0.703 

0.679 

0.689 

0.692 

0.397 

0.396 

0.382 

Mikolov 

cs:4 

0.187 

0.151 

0.282 

0.368 

0.382 

0.533 

0.042 

0.068 

0.120 

cs:8 

0.305 

0.306 

0.420 

0.462 

0.518 

0.582 

0.076 

0.095 

0.145 

cs:16 

0.344 

0.396 

0.486 

0.472 

0.539 

0.602 

0.117 

0.161 

0.184 

cs:48 

0.311 

0.375 

0.477 

0.458 

0.536 

0.591 

0.132 

0.178 

0.202 

cs:60 

0.324 

0.389 

0.479 

0.460 

0.538 

0.597 

0.151 

0.180 

0.209 

BiCVM 

iterations:200 

0.342 

0.384 

0.403 

0.309 

0.366 

0.377 

0.068 

0.084 

0.083 


Table 5; BLE results: Comparison of BWESG with (1) the BWE induction model from 
Mikolov et al. (2013b) relying on SONS, (2) BiCVM: the BWE induction model 
from Hermann and Blunsom (2014b) initially developed for parallel sentence- 
aligned data. All models were trained on the same document-aligned training 
Wikipedia data with exactly the same vocabularies. 


to other two baseline models are also attributed to the overall lower quality and size of 
NL-EN training data, which is then reflected in a lower quality of seed lexicons necessary 
to start the bootstrapping procedure from Vulic and Moens (2013b). 

Computational Complexity BWESG trained with larger values for d and cs yields richer 
semantic representations, but also naturally leads to increased training times. However, due 
to a lightweight design of the supporting SGNS, the times are by the order of magnitude 
lower than the training times for Basic-MuPTM or Association-MuPTM. Typically, several 
hours are needed to train BWESG with d = 300 and cs Ri 48 — 60, whereas it takes two to 
three days to train a bilingual topic model with K = 2000 on the same training set using 
the multi-threaded architectures on 10 Intel(R) Xeon(R) CPU E5-2667 2.90GHz processors. 
The BWESG model scales as expected (i.e., training time increases linearly with the window 
size with all other parameters being equal), and enjoys all the advantages (training time-wise 
and memory-wise) of the original word2vec package. A logical explanation for the behaviour 
follows from the interpretation of SGNS provided by Levy and Goldberg (2014a), e.g., using 
a window size of 48 instead of a window size 16 basically means using 3 times more positive 
examples for training (e.g., approximately 15 minutes is needed to train 300-dimensional 
ES-EN BWESG embeddings with cs = 16 using the Wikipedia data as opposed to 46 
minutes with cs = 48, measured again on 10 Intel(R) Xeon(R) processors). 

7.3.3 Experiment II: BWESG vs Other BWE Induction Models (Group II) 

All further experiments are conducted using BWESG with the length-ratio shuffle strategy. 
Note that again all models in comparison use exactly the same data sources and vocabularies 
as BWESG and Group I models from the previous section. The results with BiGVM and the 
Mikolov model are summarized in Table 5: the comparison reveals a clear and prominent 
advantage for the BWESG model given the same data and training setup. We do not 
report absolute scores of the BilBOWA model in this setup as they were much lower than 
the other two baseline models. The BiGVM model, although in theory fit to learn from 
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(a) Spanish-English (b) Italian-English 



(c) Dutch-English 


Figure 4: Comparison of BWESG (solid curves) with two other models that rely on par¬ 
allel training data: (1) BilBOWA (dotted curves), (2) BiCVM: the BWE induc¬ 
tion modelinitially developed for parallel sentence-aligned data (dashed horizontal 
lines). All models were trained on the same sentence-aligned training Europarl 
data with exactly the same vocabularies. BLE is performed over the same search 
space for all models, x axes are in log scale. 


document-aligned data, is unable to compete with BWESG when learning BWEs from the 
noisier setting with non-parallel data. 

We also present a preliminary study where we compare BWSESG and Group II models in 
the setup with parallel sentence-aligned data. Results are summarized in Figures 4(a)-4(c).^® 
The preliminary results clearly demonstrate that BWESG is able to learn BWEs from 
parallel data without the slightest change in its modeling principles. While the BilBOWA 
model displays better results for lower values of the cs parameter, to our own surprise, the 

16. Note that the absolute scores are not directly comparable to the BLE scores when the model is trained 
on Wikipedia data (Tables 4 and 5) due to different training data, different preprocessing steps and 
vocabularies. Different vocabularies also result in different BLE search spaces and coverages of the test 
sets (e.g., some very common Spanish nouns from the test set such as nadador (swimmer) or colmillo 
(tusk) are not observed in Europarl due to the domain shift). 


25 












































































VULIC AND MOENS 


BWESG model is comparable to or even better than the baseline models with larger window 
sizes. The BiCVM model, which implicitly utilizes the entire sentence span in training 
also outperforms BWESG with smaller windows, but BWESG again performs significantly 
better with larger windows. The BWESG performance flattens out quicker than with the 
Wikipedia data (compare the results with cs = 16 and cs = 48), which is easily explained by 
the decreased length of aligned items as provided in Table 1 (i.e., sentences vs documents). 

For English-Spanish, we can also compare BWESG to pre-trained 40-dimensional em¬ 
beddings from Ghandar et ah (2014), as their embeddings were also induced on the same 
Europarl data. While their model’s Acci score is 0.432 for d = 40, BWESG obtains Acci 
scores of 0.502 (d = 40, cs = 8), 0.535 (d = 40, cs = 16) or 0.529 (d = 40, cs = 48). 

8. Evaluation Task II: Suggesting Word Translations in Context 

In another task, we test the ability of BWEs to produce context-sensitive semantic similarity 
modeling (see Section 5.1), which in turn may be used to solve the task of suggesting word 
translations in context (SWTG) proposed recently (Vulic Sz Moens, 2014). The goal now 
is to build BWESG-based models for SWTG given the sentential context, similar as in 
the prior work. We show that our new BWESG-based SWTG models outperform the best 
SWTG models (Vulic & Moens, 2014), as well as other SWTG models which rely on the 
baseline word representations discussed in Section 4. 

8.1 Task Description 

Given an occurrence of a polysemous word Wi G and the context of that occurrence, the 
SWTG task is to choose the correct translation in the target language Lt of that particular 
occurrence of Wi from the given set TC{wi) = {ti,... ,ttq}, TC{wi) C V'^, of its tq possible 
translations/meanings. We may refer to TC{wi) as an inventory of translation candidates 
for Wi- The task of suggesting word translations in context (SWTG) may be interpreted as 
ranking the tq translation candidates with respect to the observed local context Con{wi) 
of the occurrence of the word Wi. The best scoring translation candidate according to the 
scores sim{wi,tj, Con{wi)) (see Section 5.1) in the ranked list is then the correct translation 
for that particular occurrence of Wi observing its local context Con{wi). 

8.2 Experimental Setup 

Test Data We use the SWTG test set introduced recently (Vulic &: Moens, 2014). The test 
set comprises 15 polysemous nouns in three languages (ES, IT and NL) along with sets of 
their translation candidates (i.e., sets TC). For each polysemous noun, the test sets provide 
24 sentences extracted from Wikipedia which illustrate different senses and translations of 
the pivot polysemous noun, accompanied by the annotated correct translation for each sen¬ 
tence. It yields 360 test sentences for each language pair (and 1080 test sentences in total). 
An additional set of 100 IT sentences (5 other polysemous IT nouns plus 20 sentences for 
each noun) is used as a development set to tune the parameter A (see Section 5.1) for all 
language pairs and all models in comparison. In summary, the final aim may be formulated 
as follows: For each polysemous word Wi in ES/IT/NL, the goal is to suggest its correct 
translation in English given its sentential context. 
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Evaluation Metrics Since the task is to present a list of possible translations to a SWTC 
model, and then let the model decide a single most likely translation given the word and 
its sentential context, we measure the performance again as Top 1 accuracy (Acci). 

8.3 Results and Discussion 

8.3.1 Experiment I: BWESG vs Group I 

Note that the Group I models held previously best reported SWTC scores for the train- 
ing-l-test combination. 

Models in Comparison (1) BWESG+add. RM: BWESG. SF; cos. Composition; ad¬ 
dition. A = 1.0. The value for A suggests that only context is used to disambiguate the 
meaning of a polysemous word and to guess its most likely translation in context. 

(2) BMu+HD+S. RM: BasicMuPTM. SF: Hellinger distance. Composition; Smoothed- 
Fusion^® from (Vulic k, Moens, 2014). A = 0.9. 

(3) BMu+Cue+S. RM; BasicMuPTM. SF; Cue or Association measure (Steyvers & Grif¬ 

fiths, 2007; Vulic & Moens, 2013a). Composition: Smoothed-Fusion. A = 0.9. The 
Cue similarity is tailored for probabilistic models and computed as the association score 
P{ti\w'j) = Ylk=i where Zk denotes k-th latent feature, and P{zk\w'j) de¬ 

notes the modulated probability score obtained by smoothing the probabilistic representa¬ 
tions of Wi and its context Con{wi). 

(4) TPPMI+add. RM: Traditional-PPMI. SF: cos. Composition: addition. A = 0.9. 

Again, all parameters of the baseline representation models are adopted directly from 

prior work where they were optimized on development sets comprising additional 100 sen¬ 
tences (Vulic k Moens, 2014). In addition, BMu-|-HD-|-S and BMu-|-Cue-|-S also rely on the 
procedure of context sorting and pruning (Vulic k Moens, 2014), where the idea is to retain 
only context words which are most semantically similar to the given pivot polysemous word, 
and then use them in computations. The procedure, however, produces significant gains 
only for probabilistic models (BMu-|-HD-|-S and BMu-|-Cue-|-S), and therefore, we employ 
it only for these models. BMu-|-HD-|-S and BMu-|-Cue-|-S with context sorting and pruning 
were the best scoring models in the introductory SWTC paper (Vulic k Moens, 2014) and 
currently produce state-of-the-art SWTC results on these test sets.^® 

Table 6 summarizes the results and comparison with Group I models on the SWTC task. 
NO-CONTEXT refers to the context-insensitive majority baseline (i.e., always choosing the 
most semantically similar translation candidate obtained by BWESG at the word type level, 
without taking into account any context information). 


17. We have also experimented with the context-sensitive CLSS models proposed by Melamud et al. (2015), 
but we do not report the actual scores as this model, although displaying a similar relative ranking of 
different representation models, was consistently outperformed by the models from Vulic and Moens 
(2014) in our evaluation runs: «0.75-0.80 vs «0.60-0.65 for the models from Melamud et al. (2015). 

18. In short, Smoothed-Fusion is a probabilistic variant of the context-sensitive modeling idea presented by 
equations (7)-(9). For more details, check (Vulic & Moens, 2014). 

19. We omit results for the Association-MuPTM RM since SWTC models based on Association-MuPTM 
were consistently outperformed by SWTC models based on Basic-MuPTM across different settings. 
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Pair: 


ES-EN 



IT-EN 



NL-EN 


BWESG+add 

Length-Ratio 

d=100 

d=200 

d=300 

(1=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

cs:16 

0.794* 

0.767* 

0.752* 

0.817* 

0.789 

0.794 

0.778* 

0.769* 

0.767* 

cs-AS 

0.752* 

0.758* 

0.764* 

0.814* 

0.831* 

0.814* 

0.797* 

0.789* 

0.775* 

BWESG+add 

No Shuffling 

d=100 

d=200 

d=300 

(1=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

cs:16 

0.717 

0.717 

0.694 

0.747 

0.728 

0.728 

0.722 

0.686 

0.678 

csAS 

0.731 

0.692 

0.686 

0.775 

0.778 

0.758 

0.739 

0.733 

0.719 

NO-CONTEXT 

0.406 

0.406 

0.406 

0.408 

0.408 

0.408 

0.433 

0.433 

0.433 

BMu-fHD-fS 

0.664 

0.664 

0.664 

0.731 

0.731 

0.731 

0.669 

0.669 

0.669 

BMu-fCue-l-S 

0.703 

0.703 

0.703 

0.761 

0.761 

0.761 

0.712 

0.712 

0.712 

TPPMI+add 

0.619 

0.619 

0.619 

0.706 

0.706 

0.706 

0.614 

0.614 

0.614 


Table 6; A comparison of SWTC models for Spanish-English, Italian-English and Dutch- 
English with all bilingual word representations learned from document-aligned 
Wikipedia data. The asterisk (*) denotes statistically significant improvements 
of BWESG+add over the strongest baseline according to a McNemar’s statistical 
significance test {p < 0.05). Highest scores per column are in bold. 


BWESG vs Baseline Representations The results reveal that BWESG outperforms 
baseline bilingual word representations from Group I also in the SWTC task. The im¬ 
provements are prominent for all reported values of parameters d and cs, and are often 
statistically significant even when compared to the strongest baseline (which is the fine- 
tuned BMu-|-Cue-|-S model with context sorting and pruning for all three language pairs 
from Vulic and Moens (2014)). The increase in Acci scores over the strongest baseline is 
12.9% for ES-EN, 11.9% for IT-EN, and 12.4% for NL-EN. The obtained results surpass 
previous state-of-the-art scores and are currently the best reported results on the SWTC 
datasets when using non-parallel data to learn semantic representations. 

BWESG Shuffling Strategy Although BWESG without shuffling (due to a reduced com¬ 
plexity of the SWTC task compared to BLE) already displays encouraging results, there 
is again a clear advantage to the length-ratio shuffle strategy, which displays an excellent 
performance for all three language pairs. In simple words, shuffling is again useful. 
Dimensionality and Number of Training Pairs Unlike in the BLE task, the high¬ 
est Acci scores on average are obtained by using lower-dimensional word embeddings (i.e., 
d = 100). The phenomenon may be attributed to the effect of semantic composition and the 
reduced complexity of the SWTC task compared to the BLE task. First, although enlarg¬ 
ing the dimensionality of embeddings leads to an increased semantic expressiveness within 


Senses: 

2 senses 

3 senses 

4 senses 

Model 

Acci 

Acci 

Acci 

BMu-HCue-hS 

0.827 

0.619 

0.417 

BWESG-hadd 

0.834 

0.804 

0.583 


Table 7; A comparison of the best scoring baseline model BMu-|-Cue-|-S and the best scoring 
BWESG-badd model over different clusters of words (2-sense, 3-sense and 4-sense 
words) for Spanish-English. 


28 
















Bilingual Distributed Word Representations from Document-Aligned Data 


the shared bilingual embedding space, it may be harmful when working with composition 
models, since the simple additive model of semantic composition may produce more erro¬ 
neous dimensions when constructing higher-dimensional context embeddings out of single 
word embeddings. Second, due to its design, the SWTC task requires coarser-grained rep¬ 
resentations than BLE. While in the BLE task the goal is to detect a translation of a word 
from a vocabulary which typically spans (tens of) thousands of words, in the SWTC task 
the goal is to detect the most likely translation of a word given its sentential context, but 
from a small closed vocabulary of 2-4 possible translations from the translation inventory. 
Therefore, it is highly likely that even low-dimensional embeddings are sufficient to pro¬ 
duce plausible rankings for the SWTC task, while at the same time, they are not sufficient 
and expressive enough to find correct translations in BLE. More training pairs (i.e., larger 
windows) still yield better results on average in the SWTC task. In summary, the choice of 
representation granularity is dependent on the actual task, which consequently leads to the 
conclusion that optimal values for d and cs are largely task-specific (compare also results 
in Table 4 and Table 6). 

Testing Polysemy In order to test whether the gain in performance for BWESG-|-add 
is derived mostly from the effective handling of the easiest set of words, that is, bisemous 
words (polysemous words with only 2 translation candidates), we have performed an ad¬ 
ditional experiment, where we have measured Acci scores separately for words with 2, 3, 
and 4 different senses. Results indicate that the performance gain comes mostly from gains 
on trisemous and tetrasemous words, while the scores on bisemous words are comparable. 
Table 7 shows Acci over different clusters of words for ES-EN, and similar scoring patterns 
are observed for IT-EN and NL-EN. 

Differences across Language Pairs Due to the reduced complexity of SWTC, we may 
also observe relatively higher results for NL-EN when compared to ES-EN and IT-EN, as 
opposed to their relative performance in the BLE task, where the scores for NL-EN are 
much lower than scores for ES-EN and IT-EN. Since SWTC is a less difficult task which 
requires coarse-grained representations, even limited amounts of training data may be suf¬ 
ficient to learn word embeddings which are useful for the specific task. This finding is in 
line with the recent work from Gouws and Spgaard (2015). 

8.3.2 Experiment II: BWESG vs. Other BWE Induction Models (Group II) 

We again test other BWE induction models in the SWTC task, using the same training setup 
and sets of embeddings as introduced in Section 7.3.3 for the BLE task. The representations 
were now plugged in the context-sensitive CLSS modeling framework from Section 5.1, and 
the optimization of parameters for SWTC has been conducted in the same manner as for 
BWESG. The results with the Mikolov model and BiCVM are summarized in Table 8. 
The results with BilBOWA are very similar to BiCVM, so we do not report it for brevity. 

BWESG outperforms other BWE induction models in the SWTC task and further 
confirms its utility in cross-lingual semantic modeling. The model from Mikolov et ah 
(2013b) constitutes a stronger baseline: Good results in the SWTC task with this model 
are an interesting finding per se. While the model is not competitive with BWESG and other 
baseline representations models from document-aligned data in a more difficult BLE task 
when using noisy one-to-one translation pairs, its performance on the less complex SWTC 
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Pair: 


ES-EN 



IT-EN 



NL-EN 


BWESG+add 

Length-Ratio 

d=100 

d=200 

d=300 

d=100 

(1=200 

(1=300 

(1=100 

(1=200 

(1=300 

cs:16 

0.794 

0.767 

0.752 

0.817 

0.789 

0.794 

0.778 

0.769 

0.767 

cs:48 

0.752 

0.758 

0.764 

0.814 

0.831 

0.814 

0.797 

0.789 

0.775 

Mikolov 

cs:4 

0.742 

0.739 

0.725 

0.733 

0.706 

0.692 

0.692 

0.700 

0.700 

cs:S 

0.767 

0.750 

0.747 

0.767 

0.747 

0.744 

0.694 

0.697 

0.672 

cs:16 

0.769 

0.744 

0.747 

0.758 

0.755 

0.758 

0.725 

0.700 

0.689 

cs:48 

0.678 

0.642 

0.669 

0.714 

0.714 

0.747 

0.725 

0.711 

0.708 

cs:60 

0.636 

0.658 

0.656 

0.725 

0.725 

0.742 

0.722 

0.728 

0.722 

BiCVM 

iterations:200 

0.547 

0.567 

0.539 

0.636 

0.664 

0.642 

0.586 

0.567 

0.581 


Table 8; SWTC results: Comparison of BWESG with (1) the BWE induction model from 
Mikolov et al. (2013b) relying on SONS, (2) BiCVM: the BWE induction model 
from Hermann and Blunsom (2014b) initially developed for parallel sentence- 
aligned data. All models were trained on the same document-aligned training 
Wikipedia data with exactly the same vocabularies. 


task with a reduced search space is solid even when the model relies on the imperfect set 
of translation pairs to learn the mapping between two monolingual embedding spaces. 


8.3.3 Eurther Discussion 

By analyzing the influence of pre-training shuffling on the results in two different evaluation 
tasks, we may safely establish its utility when inducing bilingual word embeddings using the 
BWESG model. While we have already presented two shuffling strategies in this work, one 
line of future work will investigate different possibilities of “blending in” words from two 
different vocabularies into pseudo-bilingual documents in a more structured and systematic 
manner. Eor instance, one approach to generating pseudo-training sentences for learning 
from textual and perceptual modalities has been recently introduced (Hill & Korhonen, 
2014). However, it is not straightforward how to extend this approach to the generation of 
pseudo-bilingual training documents. 

Another idea in the same vein is to build artificial training data of higher-quality starting 
from noisy comparable data by: (1) computing semantically similar words monolingually 
and across-languages from the noisy data, (2) retaining only highly reliable pairs of similar 
words using an automatic selection procedure (Vulic &: Moens, 2012), (3) building pseudo¬ 
bilingual documents using only reliable context word pairs. In other words, the questions is: 
Is it possible to choose positive training pairs more systematically to reduce the noise stem¬ 
ming from non-parallel data? The construction of such artificial training data and training 
on such data would then proceed in a bootstrapping fashion, and the model should be able 
to steadily reduce noise inherently present in comparable data. The idea of “improving 
corpus comparability” was only touched upon in previous work (Li &: Gaussier, 2010; Li, 
Gaussier, &: Aizawa, 2011). 

While the entire framework proposed in this article is in theory completely language 
pair agnostic as it does not make any language pair dependent modeling assumptions, we 
acknowledge the fact that all three language pairs comprise languages coming from the same 
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phylum, that is, the Indo-European language family. Future extensions also include porting 
the framework to other more distant language pairs that do not share the same roots nor the 
same alphabet (e.g., English-Chinese/Hindi/Arabic), and for which benchmarking test sets 
are still scarce for a variety of semantic tasks (e.g., SWTC) (Camacho-Collados, Pilehvar, 
& Navigli, 2015). We believe that larger window sizes may solve difficulties with different 
word orderings (e.g., for Chinese-English). 

9. Conclusions and Future Work 

We have proposed and described Bilingual Word Embeddings Skip-Gram (BWESG), a sim¬ 
ple yet effective bilingual word representation learning model which is able to induce bilin¬ 
gual word embeddings solely on the basis of document-aligned comparable data. BWESG 
is based on the omnipresent skip-gram with negative sampling (SGNS). We have presented 
two ways to build pseudo-bilingual documents on which a monolingual SGNS (or any 
monolingual WE induction model) may be trained to produce shared bilingual embedding 
spaces. The BWESG model does not make any language-pair dependent assumptions nor 
requires language-pair specific external resources such as bilingual lexicons, predefined cat¬ 
egory/ontology knowledge or parallel data. We have showed that the model may be trained 
on non-parallel and parallel data without any changes in modeling principles, which, com¬ 
plemented with its simplicity and lightweight design makes it potentially very useful as a 
tool for researchers in machine translation and information retrieval. 

We have employed induced BWEs in two semantic tasks: (1) bilingual lexicon extraction 
(BLE), and (2) suggesting word translations in context (SWTG). Our new BWESG-based 
BLE and SWTC models outperform previous state-of-the-art models for BLE and SWTC 
from document-aligned comparable data and related BWE induction models (Mikolov et ah, 
2013b; Chandar et ah, 2014; Gouws et ah, 2015). The findings in this article follow the 
recently published surveys from Baroni et al. (2014), Levy et al. (2015) regarding a solid and 
robust performance of neural word representations/word embeddings in semantic tasks: our 
new BWESG-based models for BLE and SWTC significantly outscore previous state-of-the- 
art distributional approaches on both tasks across different parameter settings. Even more 
encouraging is the fact that these new state-of-the-art results are attained using default 
parameter settings for the BWESG model as suggested in the word2vec package without 
any development set. Further (finer) tuning of model parameters in future work may lead 
to higher-quality bilingual embedding spaces. 

Several straightforward lines of future research have already been tackled in Sections 7 
and 8. For instance, the current length-ratio shuffling strategy may be replaced by a more 
advanced shuffling method in future work. Moreover, BWEs induced by BWESG may 
be used in other semantic tasks besides the ones discussed in this work, and it would be 
interesting to experiment with other types of context aggregation and selection beyond the 
bag-of-words assumption, such as dependency-based contexts (Levy & Goldberg, 2014a), or 
other objective functions during training in the same vein as proposed by Levy and Goldberg 
(2014b). Similar to the evolution in multilingual probabilistic topic modeling, another path 
of future work may lead to investigating bilingual models for learning BWEs which will be 
able to jointly learn from separate documents in aligned document pairs, without the need 
to construct pseudo-bilingual documents. 
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A natural step in the text representation learning research is to extend the focus from 
single word representations to composite phrase, sentence and document representations 
(Hermann &: Blunsom, 2013; Kalchbrenner, Grefenstette, & Blunsom, 2014; Le Sz Mikolov, 
2014; Soyer et ah, 2015). In this article, we have relied on a simple composition model based 
on vector addition, and have shown that this model performs excellent in the SWTC task. 
However, in the long run this model is not by any means sufficient to effectively capture 
all complex compositional phenomena in the data. Several models which aim to learn 
sentence and document embeddings have been proposed recently, but they critically rely 
on sentence-aligned parallel data. It is yet to be seen how to build structured multilingual 
phrase, sentence and document embeddings solely on the basis of comparable data. Such 
low-cost multilingual embeddings beyond the word level extracted from comparable data 
may find its application in a variety of tasks such as statistical machine translation (Mikolov 
et ah, 2013b; Zou et ak, 2013; Zhang et ah, 2014; Wu et ah, 2014), semantic tasks such 
as multilingual semantic textual similarity (Agirre, Banea, Cardie, Cer, Diab, Gonzalez- 
Agirre, Guo, Mihalcea, Rigau, &: Wiebe, 2014), cross-lingual information retrieval (Vulic 
et ak, 2013; Vulic &: Moens, 2015) or cross-lingual document classification (Klementiev 
et ak, 2012; Hermann &: Blunsom, 2014b; Chandar et ak, 2014). 

In another future research path, we may use the knowledge of BWEs obtained by 
BWESG from document-aligned data to learn bilingual correspondences (e.g., word trans¬ 
lation pairs or lists of semantically similar words across languages) which may in turn be 
used for learning from large unaligned multilingual datasets (Mikolov et ak, 2013b; Al-Rfou, 
Perozzi, & Skiena, 2013). In the long run, this idea may lead to large-scale learning mod¬ 
els from huge amounts of multilingual data without any requirement for parallel data or 
manually built bilingual lexicons. 
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